FEAT - Adding `decimal` as parameter for `ToFloat32` #1772

gabrielapgomezji · 2025-11-24T16:19:57Z

This issue addresses #1728

The ToFloat transformer now includes a decimal parameter that lets the user specify the decimal separator to use for the given column. Then, all the possible thousands separators are removed, and the decimal separator is converted to a . before the column is passed to to_float32.

doc/modules/column_level_featurizing/feature_engineering_numerical.rst

skrub/_to_float.py

emassoulie · 2025-12-02T09:31:54Z

skrub/tests/test_to_float.py

+        (",56", 0.56, ","),
+    ],
+)
+def test_number_parsing(input_str, expected_float, decimal, df_module):


Might it be worth adding tests for the code's behaviour in case of an invalid entry?

yes, we should check a few weird cases and make sure they fail as expected

rcap107 · 2025-12-02T13:54:35Z

After some discussion, I think this PR needs some more time before it can be merged, and unfortunately won't be part of the next release.

The current implementation is removing all thousands separators other than what is specified as the "decimal" separator, which is quite risky and may leads to problems. It's better to follow what pandas is doing, i.e., have both decimal and thousands as separators. By default, the thousands separator should be None (so no replacement).

If there is some kind of weird string like 1,2.3,4, it should not be parsed as a number. I am not sure how far we should do to parse something like 1,2.34 with decimal . and thousands ,: it's not a format I recognize, but it would still be recognized as 12.34 rather than being rejected.

Another check that may be considered is counting the number of decimal separators, and reject any case where there is more than one.

Some additional comments:

While it's impossible to test all possible scenarios, tests should also include as many weird edge cases as we can come up with to see what could be the result.
The ToFloat docstring needs some more work to explain in more detail the behavior when decimal and thousands are set.

I'll convert this back to draft and keep an eye on this for the next PR.

gabrielapgomezji · 2025-12-17T11:24:40Z

When talking about the tests, it was mentioned to include 3 tests:

A test for Good inputs
A test for Bad Inputs
A test for bad parameters
I merged the last two tests including also bada parameters in the test. If it's better to have the 3 tests individually instead of the 2, I will modify it.

doc/modules/column_level_featurizing/feature_engineering_numerical.rst

rcap107 · 2025-12-17T13:13:05Z

doc/modules/column_level_featurizing/feature_engineering_numerical.rst

+During ``fit``, |ToFloat| attempts to convert all values in the column to
+numeric values after automatically removing other possible thousands separators
+(``,``, ``.``, space, apostrophe). If any value cannot be converted, the column
+is rejected with a ``RejectColumn`` exception.


I think this is not how the current version of the code is working: the regex pattern should reject anything that contains characters different from either the decimal or thousands separator.

There should also be an explanation of how the check is done (checking if there are parentheses, checking if thousands are separated by groups of 3 digits, adding the scientific notation)

doc/modules/column_level_featurizing/feature_engineering_numerical.rst

skrub/_to_float.py

rcap107 · 2025-12-17T14:16:07Z

skrub/tests/test_to_float.py

+        ("1,,234", ".", ","),
+        ("1.23,45", ".", ","),
+        # decimal == thousand
+        ("123,456,789", ",", ","),


Here we are testing that RejectColumn is raised as expected when it encounters values that should not be converted. This case should be moved to a separate test that verifies that the correct exception is raised if the parameters are incorrect. The same (new) test should also check that a ValueError is raised if decimal is None.

rcap107 · 2025-12-17T14:18:28Z

Thanks a lot for the PR @gabrielapgomezji! This will be very useful for parsing data that is not in the usual locale.

My comments are mostly about improving clarity in the documentation and adding comments in the code. I think the actual content of the PR is in a good shape, it's just a matter of polishing at this point.

rcap107 · 2025-12-18T10:48:44Z

skrub/_to_float.py

+        if self.thousand is None:
+            self.thousand = ""  # No thousand separator


This should be moved to the init, parameters should not be modified in the fit

…ical.rst Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>

Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>

rcap107 · 2026-01-20T13:13:44Z

skrub/_to_float.py

+def _str_is_valid_number_polars(col, number_re):
+    # Check if all values in the column match the number regex.
+    # - Fill NaN values with empty string to avoid match errors.
+    # - Use `str.match` with `na=False` to treat empty/missing values as non-matching.


Suggested change

# - Use `str.match` with `na=False` to treat empty/missing values as non-matching.

# - Use `str.contains` with `literal=False` to treat empty/missing values as non-matching.

rcap107

A few more cosmetic fixes, but I think we're almost done here. Thanks @gabrielapgomezji

rcap107 · 2026-01-20T13:15:17Z

skrub/_to_float.py

+        strings to floats. Other possible decimal separators are removed from
+        the strings before conversion.


not the case anymore

Suggested change

strings to floats. Other possible decimal separators are removed from

the strings before conversion.

strings to floats.

rcap107 · 2026-01-20T13:16:12Z

skrub/_to_float.py

+    1    12300.0
+    Name: x, dtype: float32
+
+    It is possible to specify the thousands separator, e.g., to use " "


Suggested change

It is possible to specify the thousands separator, e.g., to use " "

It is possible to specify the thousands separator, e.g., to use ``" "``

rcap107 · 2026-01-20T13:16:49Z

skrub/_to_float.py

+
+    It is possible to specify the thousands separator, e.g., to use " "
+    >>> s = pd.Series(["4 567,89", "12 567,89"], name="x")
+    >>> ToFloat(decimal=",", thousand=" ").fit_transform(s) # doctest: +ELLIPSIS


ELLIPSIS is enabled by default

Suggested change

>>> ToFloat(decimal=",", thousand=" ").fit_transform(s) # doctest: +ELLIPSIS

>>> ToFloat(decimal=",", thousand=" ").fit_transform(s)

rcap107 · 2026-01-20T14:20:04Z

I did a very quick and dirty benchmark comparing the performance of the ToFloat transformer in the latest version of skrub, and the ToFloat in this PR.

I generated a dataframe with 10M rows and 30 columns.

There is a small, but noticeable difference in time when there is no conversion to be done, so we might want to add a condition where the formatting check is skipped in the default case (decimal="." and thousands=None).

skrub version: 0.8.dev0
Elapsed time for ToFloat transformation: 0.3026 seconds
skrub version: 0.7.1
Elapsed time for ToFloat transformation: 0.2914 seconds

Code

import time
import polars as pl
import numpy as np
import skrub
from importlib import metadata
import polars.selectors as cs

version = metadata.version("skrub")
print(f"skrub version: {version}")

# Set random seed for reproducibility
np.random.seed(42)

# Generate random float data
data = np.random.uniform(1000, 1000000, size=(10_000_000, 30))

# Convert floats to strings with space as thousands separator
df = pl.DataFrame(data)

from skrub import ToFloat, ApplyToCols
def benchmark_tofloat(df):
    tic = time.time()
    transformer = ApplyToCols(ToFloat())
    transformed_df = transformer.fit_transform(df)
    toc = time.time()

    return toc - tic

# Run benchmark
times = []
for run in range(100):
    elapsed_time = benchmark_tofloat(df)
    times.append(elapsed_time)

elapsed_time = np.median(times)
print(f"Elapsed time for ToFloat transformation: {elapsed_time:.4f} seconds")

rcap107 changed the title ~~WIP: Adding decimal conversion and tests~~ FEAT - Adding decimal as parameter for ToFloat32 Nov 24, 2025

rcap107 mentioned this pull request Nov 24, 2025

FEAT - adding a heuristic for parsing units in string columns #1726

Draft

gabrielapgomezji marked this pull request as ready for review December 1, 2025 14:03

rcap107 reviewed Dec 1, 2025

View reviewed changes

doc/modules/column_level_featurizing/feature_engineering_numerical.rst Outdated Show resolved Hide resolved

emassoulie reviewed Dec 2, 2025

View reviewed changes

rcap107 marked this pull request as draft December 2, 2025 13:54

rcap107 reviewed Dec 17, 2025

View reviewed changes

doc/modules/column_level_featurizing/feature_engineering_numerical.rst Outdated Show resolved Hide resolved

rcap107 marked this pull request as ready for review December 17, 2025 13:00

rcap107 reviewed Dec 17, 2025

View reviewed changes

doc/modules/column_level_featurizing/feature_engineering_numerical.rst Outdated Show resolved Hide resolved

rcap107 reviewed Dec 17, 2025

View reviewed changes

doc/modules/column_level_featurizing/feature_engineering_numerical.rst Outdated Show resolved Hide resolved

rcap107 requested changes Dec 17, 2025

View reviewed changes

rcap107 reviewed Dec 18, 2025

View reviewed changes

rcap107 linked an issue Dec 18, 2025 that may be closed by this pull request

ToFloat fails when trying to parse numbers with "," decimal separators #1728

Open

ggomezji added 13 commits January 19, 2026 15:04

WIP: Adding decimal conversion and tests

e5fe91f

Added tests and examples

c186201

Added doctest skip

67f00c7

Added documentation

47cc97d

Added elipsis on doctests

daa9557

Fixed example doc

292a5c1

Improved users guide

6821b32

Fixed tests

9df1ba2

WIP: Improved column verification

620bd12

WIP: Removed pattern and include thousand separator

0be30f3

WIP: Regex modification for polars

ec7d687

Improved tests

3e6dea1

Improving the docstrings and documentation

f8e63a6

ggomezji and others added 10 commits January 19, 2026 15:08

Improving documentation

50d9b47

Update doc/modules/column_level_featurizing/feature_engineering_numer…

36d1d8f

…ical.rst Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>

Update doc/modules/column_level_featurizing/feature_engineering_numer…

0f149f1

…ical.rst Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>

Update doc/modules/column_level_featurizing/feature_engineering_numer…

415aec1

…ical.rst Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>

Update skrub/_to_float.py

a19e149

Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>

Update skrub/_to_float.py

1754d07

Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>

Update skrub/_to_float.py

db44c3e

Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>

Update skrub/_to_float.py

0b71c26

Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>

WIP

94435da

Reverting changes and cleaning up history

095f403

rcap107 force-pushed the 1728-ToFloat_improvement branch from 5e1dd9b to 095f403 Compare January 20, 2026 10:55

rcap107 reviewed Jan 20, 2026

View reviewed changes

		if self.thousand is None:
		self.thousand = "" # No thousand separator

	# - Use `str.match` with `na=False` to treat empty/missing values as non-matching.
	# - Use `str.contains` with `literal=False` to treat empty/missing values as non-matching.

		strings to floats. Other possible decimal separators are removed from
		the strings before conversion.

	strings to floats. Other possible decimal separators are removed from
	the strings before conversion.
	strings to floats.

	It is possible to specify the thousands separator, e.g., to use " "
	It is possible to specify the thousands separator, e.g., to use ``" "``

	>>> ToFloat(decimal=",", thousand=" ").fit_transform(s) # doctest: +ELLIPSIS
	>>> ToFloat(decimal=",", thousand=" ").fit_transform(s)

FEAT - Adding decimal as parameter for ToFloat32 #1772

Are you sure you want to change the base?

FEAT - Adding decimal as parameter for ToFloat32 #1772

Uh oh!

Conversation

gabrielapgomezji commented Nov 24, 2025 • edited by rcap107 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rcap107 commented Dec 2, 2025

Uh oh!

gabrielapgomezji commented Dec 17, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rcap107 commented Dec 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rcap107 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rcap107 commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

FEAT - Adding `decimal` as parameter for `ToFloat32` #1772

FEAT - Adding `decimal` as parameter for `ToFloat32` #1772

gabrielapgomezji commented Nov 24, 2025 •

edited by rcap107

Loading